Improvement of three simultaneous speech recognition by using AV integration and scattering theory for humanoid

نویسندگان

  • Kazuhiro Nakadai
  • Daisuke Matsuura
  • Hiroshi G. Okuno
  • Hiroshi Tsujino
چکیده

This paper presents improvement of recognition of three simultaneous speeches for a humanoid robot with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speeches, two key ideas are introduced — acoustical modeling of robot head by scattering theory and two-layered audio-visual integration in both name and location, that is, speech and face recognition, and speech and face localization. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using interaural phase/intensity difference estimated by scattering theory. Since features of sounds separated by ADPF vary according to the sound direction, multiple Directionand Speaker-dependent (DS-dependent) acoustic models are used. The system integrates ASR results by using the sound direction and speaker information by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows around 10% improvement on average against recognition of three simultaneous speeches, where three talkers were located 1 meter from the humanoid and apart from each other by 0 to 90 degrees at 10-degree intervals.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Three simultaneous speech recognition by integration of active audition and face recognition for humanoid

This paper addresses listening to three simultaneous talkers by a humanoid with two microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. Huma...

متن کامل

Simultaneous Speech Recognition Based on Automatic Missing Feature Mask Generation by Integrating Sound Source Separation

Our goal is to realize a humanoid robot that has the capabilities of recognizing simultaneous speech. A humanoid robot under real-world environments usually hears a mixture of sounds, and thus three capabilities are essential for robot audition; sound source localization, separation, and recognition of separated sounds. In particular, an interface between sound source separation and speech reco...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Title of Dissertation : CORTICAL DYNAMICS OF AUDITORY - VISUAL SPEECH : A FORWARD MODEL OF MULTISENSORY INTEGRATION

Title of Dissertation: CORTICAL DYNAMICS OF AUDITORYVISUAL SPEECH: A FORWARD MODEL OF MULTISENSORY INTEGRATION. Virginie van Wassenhove, Ph.D., 2004 Dissertation Directed By: David Poeppel, Ph.D., Department of Linguistics, Department of Biology, Neuroscience and Cognitive Science Program In noisy settings, seeing the interlocutor’s face helps to disambiguate what is being said. For this to hap...

متن کامل

روشی جدید در بازشناسی مقاوم گفتار مبتنی بر دادگان مفقود با استفاده از شبکه عصبی دوسویه

Performance of speech recognition systems is greatly reduced when speech corrupted by noise. One common method for robust speech recognition systems is missing feature methods. In this way, the components in time - frequency representation of signal (Spectrogram) that present low signal to noise ratio (SNR), are tagged as missing and deleted then replaced by remained components and statistical ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003